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Abstract 

Networks arising from social, technological 
and natural domains exhibit rich connectiv- 
ity patterns and nodes in such networks are 
often labeled with attributes or features. We 
address the question of modeling the struc- 
ture of networks where nodes have attribute 
information. We present a Multiplicative At- 
tribute Graph (MAG) model that considers 
nodes with categorical attributes and models 
the probability of an edge as the product of 
individual attribute link formation affinities. 
We develop a scalable variational expectation 
maximization parameter estimation method. 
Experiments show that MAG model reliably 
captures network connectivity as well as pro- 
vides insights into how different attributes 
shape the network structure. 

1 Introduction 

Social and biological systems can be modeled as in- 
teraction networks where nodes and edges represent 
entities and interactions. Viewing real systems as net- 
works led to discovery of underlying organizational 
principles [31 [TB] as well as to high impact applica- 
tions [14] . As organizational principles of networks are 
discovered, questions are as follow: Why are networks 
organized the way they are? How can we model this? 

Network modeling has rich history and can be roughly 
divided into two streams. First arc the explanatory 
"mechanistic" models [3 [12] that posit simple gener- 
ative mechanisms that lead to networks with realis- 
tic connectivity patterns. For example, the Copying 
model [7] states a simple rule where a new node joins 
the network, randomly picks an existing node and links 
to some of its neighbors. One can prove that under this 
generative mechanism networks with power-law degree 
distributions naturally emerge. Second line of work are 
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statistical models of network structure [TJ HI (TBI [IT] 
which are usually accompanied by model parameter 
estimation procedures and have proven to be useful for 
hypothesis testing. However, such models are often an- 
alytically untractable as they do not lend themselves 
to mathematical analysis of structural properties of 
networks that emerge from the models. 

Recently a new line of work [TSldH] has emerged. It de- 
velops network models that are analytically tractable 
in a sense that one can mathematically analyze struc- 
tural properties of networks that emerge from the 
models as well as statistically meaningful in a sense 
that there exist efficient parameter estimation tech- 
niques. For instance, Kronecker graphs model |10| can 
be mathematically proved that it gives rise to networks 
with a small diameter, giant connected component, 
and so on [131 E]- Also, it can be fitted to real net- 
works [llj to reliably mimic their structure. 

However, the above models focus only on modeling the 
network structure while not considering information 
about properties of the nodes of the network. Often 
nodes have features or attributes associated with them. 
And the question is how to characterize and model the 
interactions between the node properties and the net- 
work structure. For instance, users in a online social 
network have profile information like age and gender, 
and we are interested in modeling how these attributes 
interact to give rise to the observed network structure. 

We present the Multiplicative Attribute Graphs (MAG) 
model that naturally captures interactions between the 
node attributes and the observed network structure. 
The model considers nodes with categorical attributes 
and the probability of an edge between a pair of nodes 
depends on the individual attribute link formation 
affinities. The MAG model is analytically tractable 
in a sense that we can prove that networks arising 
from the model exhibit connectivity patterns that are 
also found in real- world networks [5]. For example, 
networks arising from the model have heavy-tailed de- 
gree distributions, small diameter and unique giant 



connected component [S]. Moreover, the MAG model 
captures homophily (i.e., tendency to link to similar 
others) as well as hetcrophily (i.e.. tendency to link to 
different others) of different node attributes. 

In this paper we develop MagFit, a scalable parame- 
ter estimation method for the MAG model. We start 
by defining the generative interpretation of the model 
and then cast the model parameter estimation as a 
maximum likelihood problem. Our approach is based 
on the variational expectation maximization frame- 
work and nicely scales to large networks. Experiments 
on several real-world networks demonstrate that the 
MAG model reliably captures the network connectiv- 
ity patterns and outperforms present state-of-the-art 
methods. Moreover, the model parameters have natu- 
ral interpretation and provide additional insights into 
how node attributes shape the structure of networks. 

2 Multiplicative Attribute Graphs 

The Multiplicative Attribute Graphs model (MAG) [S] 
is a class of generative models for networks with node 
attributes. MAG combines categorical node attributes 
with their affinities to compute the probability of a 
link. For example, some node attributes [e.g., polit- 
ical affiliation) may have positive affinities in a sense 
that same political view increases probability of being 
linked {i.e., homophily), while other attributes may 
have negative affinities, i.e., people are more likely to 
link to others with a different value of that attribute. 

Formally, we consider a directed graph A (represented 
by its binary adjacency matrix) on N nodes. Each 
node i has L categorical attributes, Fn, ■ ■ ■ ,FiL and 
each attribute I {I = 1, - ■ ■ , L) is associated with affin- 
ity matrix 0; which quantifies the affinity of the at- 
tribute to form a link . Each entry 8;[/c, k'] G (0, 1) of 
the affinity matrix indicates the potential for a pair of 
nodes to form a link, given the ^-th attribute value k 
of the first node and value k' of the second node. For 
a given pair of nodes, their attribute values "select" 
proper entries of affinity matrices, i.e., the first node's 
attribute selects a "row" while the second node's at- 
tribute value selects a "column" . The link probability 
is then defined as the product of the selected entries 
of affinity matrices. Each edge (i, j) is then included 
in the graph A independently with probability pij : 



(1) 



1=1 



Figure [T] illustrates the model. Nodes i and j have the 
binary attribute vectors [0,0,1,0] and [0,1,1,0], re- 
spectively. We then select the entries of the attribute 
matrices, ei[0,0], e2[0,l], e3[l,l], and e4[0,0] and 
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Figure 1: Multiplicative Attribute Graph (MAG) 
model. Each node i has categorical attribute vector 
Fi. The probability ptj of edge (i,j) is then deter- 
mined by attributes "selecting" appropriate the entries 
of attribute affinity matrices Qi. 

compute the link probability pij of link (i, j) as a prod- 
uct of these selected entries. 

Kim & Leskovec [S] proved that the MAG model cap- 
tures connectivity patterns observed in real- world net- 
works, such as heavy-tailed (power-law or log-normal) 
degree distributions, small diameters, unique giant 
connected component and local clustering of the edges. 
They provided both analytical and empirical evidence 
demonstrating that the MAG model effectively cap- 
tures the structure of real- world networks. 

The MAG model can handle attributes of any cardi- 
nality, however, for simplicity we limit our discussion 
to binary attributes. Thus, every Fu takes value of 
either or 1, and every 6; is a 2 x 2 matrix. 

Model parameter estimation. So far we have seen 
how given the node attributes F and the correspond- 
ing attribute affinity matrices Q we generate a MAG 
network. Now we focus on the reverse problem: Given 
a network A and the number of attributes L we aim 
to estimate affinity matrices Q and node attributes F. 

In other words, we aim to represent the given real net- 
work A in the form of the MAG model parameters: 
node attributes F = {Fu; i = 1, ■ ■ ■ , N, Z = 1, • • • , i} 
and attribute affinity matrices Q = {Qi', I = 1, ■ ■ ■ , L}. 
MAG yields a probabilistic adjacency matrix that in- 
dependently assigns the link probability to every pair 
of nodes, the likelihood P{A\F,Q) of a given graph 
(adjacency matrix) A is the product of the edge prob- 
abilities over the edges and non-edges of the network: 



p{A\F,e)= n p^, n (1 

and Pij is defined in Eq. ([1} . 



(2) 
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Now we can use the maximum likelihood estimation to 
find node attributes F and their affinity matrices Q. 
Hence, ideally we would like to solve 



argmaxP(A|i^, 9). 

F,0 



(3) 



However, there are several challenges with this prob- 
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Figure 2: MAG model: Node attributes Fu are sam- 
pled from Hi and combined with affinity matrices Qi 
to generate a probabilistic adjacency matrix P. 

lem formulation. First, notice that Eq. ([3]) is a combi- 
natorial problem of 0{LN) categorical variables even 
when the affinity matrices are fixed. Finding both F 
and simultaneously is even harder. Second, even if 
we could solve this combinatorial problem, the model 
has a lot of parameters which may cause high variance. 

To resolve these challenges, we consider a simple gen- 
erative model for the node attributes. We assume 
that the l-th attribute of each node is drawn from an 
i.i.d. Bernoulli distribution parameterized by /^i . This 
means that the l-ih. attribute of every node takes value 
1 with probability ^i, i.e., Fu ^ Bernoulli (fJ-i). 

Figure m illustrates the model in plate notation. First, 
node attributes Fu are generated by the corresponding 
Bernoulli distributions fii. By combining these node 
attributes with the affinity matrices &i, the probabilis- 
tic adjacency matrix P is formed. Network A is then 
generated by a series of coin flips where each edge Aij 
appears with probability Pij . 

Even this simplified model provably generates net- 
works with power-law degree distributions, small di- 
ameter, and unique giant component [S]. The simpli- 
fied model requires only 5L parameters (4 per each Qi, 
1 per fil). Note that the number of attributes L can 
be thought of as constant or slowly increasing in the 
number of nodes N {e.g., L = O(logiV)) |2l[5]. 

The generative model for node attributes slightly mod- 
ifies the objective function in Eq. We maintain the 
maximum likelihood approach, but instead of directly 
finding attributes F we now estimate parameters fii 
that then generate latent node attributes F. 

We denote the log-likelihood log P{A\fi, Q) as C{fi, 8) 
and aim to find fi = {fJ-i} and Q = {Oi} by maximizing 

Cifi, Q) ^ log PiA\fi, 6) = log ^ P (A, F\fi, Q) . 

F 

Note that since fi and are linked through F we 
have to sum over all possible instantiations of node 
attributes F. Since F consists oi L ■ N binary vari- 
ables, the number of all possible instantiations of F 
is 0(2^^), which makes computing 0) directly 



intractable. In the next section we will show how to 
quickly (but approximately) compute the summation. 

To compute likelihood P (A, 0), we have to con- 
sider the likelihood of node attributes. Note that each 
edge Aij is independent given the attributes F and 
each attribute Fu is independent given the parame- 
ters III. By this conditional independence and the fact 
that both Aij and Fu follow Bernoulli distributions 
with parameters pij and fii we obtain 

PiA, F\fi, 0) - P{A\F, fi, e)P{F\ii, 0) 
= P{A\F,Q)P{F\fi) 

= n n (i-^^'j) u (i-'^') 

Aij = l Aij=0 Fii=Q Fii = l 

where pij is defined in Eq. ([Ij . 

3 MAG Parameter Estimation 

Now, given a network A, we aim to estimate the pa- 
rameters Hi of the node attribute model as well as the 
attribute affinity matrices 0;. We regard the actual 
node attribute values F as latent variables and use 
the expectation maximization framework. 

We present the approximate method to solve the 
problem by developing a variational Expectation- 
Maximization (EM) algorithm. We first derive the 
lower bound CQ{fi,Q) on the true log-likelihood 
C{fi, 0) by introducing the variational distribution 
Q{F) parameterized by variational parameters 0. 
Then, we indirectly maximize C{ii, 0) by maximiz- 
ing its lower bound Cq{h,Q). In the E-step, we es- 
timate Q{F) by maximizing i2Q(/x, 0) over the varia- 
tional parameters </>. In the M-step, we maximize the 
lower bound Cq{ijl,Q) over the MAG model parame- 
ters (/X and 0) to approximately maximize the actual 
log-likelihood £(/x, 0). We alternate between E- and 
M-steps until the parameters converge. 

Variational EM. Next we introduce the distribution 
Q{F) parameterized by variational parameters 0. The 
idea is to define an easy-to-compute Q{F) that allows 
us to compute the lower-bound Cq{ijl,Q) of the true 
log-likelihood £(/x, 0). Then instead of maximizing 
the hard-to-compute C, we maximize Cq. 

We now show that in order to make the gap between 
the lower-bound Cq and the original log likelihood C 
small we should find the easy-to-compute Q{F) that 
closely approximates P(i^| A, /i, 0). For now we keep 
Q{F) abstract and precisely define it later. 

We begin by computing the lower bound Cq in terms 
of Q{F). We plug Q{F) into C{h, 0) as follows: 



£(/i,e) = iog^p(4,F|^,e) 
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QiF) 
PiA,F\^^,Q) 



QiF) 



(5) 



As logx is a concave function, by Jensen's inequality, 
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Therefore, by taking 



> Er 



log 



PiA,F\fi,e) 



QiF) 



CQip,e) = EQ[logPiA,F\^i,e) ^logQiF)] , (6) 

Cqin, 6) becomes the lower bound on 6). 

Now the question is how to set QiF) so that we 
make the gap between Cq and C as small as possi- 
ble. The lower bound Cq is tight when the proposal 
distribution QiF) becomes close to the true poste- 
rior distribution P(F|^,//,8) in the KL divergence. 
More precisely, since 9) is independent of F, 

£(M,e) = logP(^|//,e) = Eq [logP(A|Ai,e)]. Thus, 
the gap between C and Cq is 

£(M,e)-/:Q(M,e) 

= log P(^|/i, 9) - Eq [log P(A, Pl^i, 6) - log Q{F)] 
= Eg [log P(^|/i, e) - log P( A, P|m, 6) + log Q(P)] 
= Eq [logP(P|A/^,e)-logQ(P)] , 

which means that the gap between C and Cq is exactly 
the KL divergence between the proposal distribution 
QiF) and the true posterior distribution P(P| A, ^, 0). 

Now we know how to choose QiF) to make the gap 
small. We want QiF) that is easy-to-compute and at 
the same time closely approximates P(P|A, ^, Q). We 
propose the following QiF) parameterized by (p: 

Fii Bernoullii4)ii) 
QuiFa) = <jya^^' (1 - (/..z)'"^" 

QiF) = \{QuiFu) (7) 

i.i 

where <j) = {0^;} are variational parameters and F ~ 
\Fi{\. Our QiF) has several advantages. First, the 
computation of Cq for fixed model parameters /i and 
is tractable because logP(A, P|/i, 0) — log(5(P) in 
Eq. ^ is separable in terms of Fn . This means that we 
are able to update each (pn in turn to maximize Cq by 
fixing all the other parameters: /i, and all except 
the given Furthermore, since each (pn represents 
the approximate posterior distribution of Fn given the 
network, we can estimate each attribute Fn by 0^;. 



Regularization by mutual information. In or- 
der to improve the robustness of MAG parameter es- 
timation procedure, we enforce that each attribute is 
independent of others. The maximum likelihood es- 
timation cannot guarantee the independence between 
the node attributes and so the solution might converge 
to local optima where the attributes are correlated. 
To prevent this, we add a penalty term that aims to 
minimize the mutual information (i.e., maximize the 
entropy) between pairs of attributes. 

Since the distribution for each attribute Fn is defined 
by we define the mutual information between a 
pair of attributes in terms of 0. We denote this mutual 
information as MI(P) = ^Ylii^v "^^w where MI;;' rep- 
resents the mutual information between the attributes 
I and /'. We then regularize the log- likelihood with the 
mutual information term. We arrive to the following 
MagFit optimization problem that we actually solve 



arg max (/i, 0) - A > 



ML, 



(8) 



We can quickly compute the mutual information MI;;/ 
between attributes / and V . Let ^'{.;} denote a random 
variable representing the value of attribute Then, 
the probability P{F^.i-^ — x) that attribute / takes 
value X is computed by averaging Quix) over i. Sim- 
ilarly, the joint probability P(P{.j} = a;, P{.i'} = y) of 
attributes I and taking values x and y can be com- 
puted given QiF). We compute MIo' using Qu defined 
in Eq. ([7]) as follows: 
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Mlii' = ^ pii' (a;, y) log 

a;/ye{0,l} 
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(9) 



The MagFit algorithm. To solve the regularized 
MagFit problem in Eq. ^ , we use the EM algorithm 
which maximizes the lower bound £g(/i, 0) regular- 
ized by the mutual information. In the E-stcp, we 
reduce the gap between the original likelihood Cifi, 0) 
and its lower bound CQi^,, 0) as well as minimize the 
mutual information between pairs of attributes. By 
fixing the model parameters /i and 0, we update ipu 
one by one using a gradient-based method. In the 
M-step, we then maximize Cq{^^Q) by updating the 
model parameters /i and 0. We repeat E- and M-steps 
until all the parameters 0, ^, and converge. Next 
we briefly overview the E- and the M-step. We give 
further details in Appendix. 

Variational E-Step. In the E-step, we consider 
model parameters ji and as given and we aim to find 



Algorithm 1 MagFit-VarEStep(A, 6) 
Initialize (^'"^ = {(fm : i = !,■■■ ,N, / = 1, • • • 

for f ^ to T - 1 do 

Select S' C 0'*' with |S| = B 
for <^^*^ e 5" do 
Compute 
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end for 
end for 
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Algorithm 2 MagFit-VarMStep((/), G, G'^)) 
for i ^ 1 to L do 

Mi ^ :^ 0" 

end for 



for f ^ to r - 1 do 
for Z •<— 1 to L do 

end for 
end for 



the values of variational parameters (j) that maximize 
Cq{^,, 0) as well as minimize the mutual information 
MI(F). We use the stochastic gradient method to up- 
date variational parameters (j). We randomly select a 
batch of entries in (j) a-nd update them by their gra- 
dient values of the objective function in Eq. ([8]). We 
repeat this procedure until parameters 4> converge. 



First, by computing and we obtain the gra- 

dient {Cq{ii,Q) - AMI(i^)) (see Appendix for de- 
tails). Then we choose a batch of 4>ii at random and 



aMI 



update them by ~^^rf ^3.ch step. The mutual 
information regularization term typically works in the 
opposite direction of the likelihood. Intuitively, the 
regularization prevents the solution from being stuck 
in the local optimum where the node attributes are 
correlated. Algorithm [T] gives the pseudocode. 

Variational M-Step. In the E-step, we intro- 
duced the variational distribution Q{F) parameter- 
ized by (f) and approximated the posterior distribution 
P{F\A, fi,&) by maximizing Cg^fi^Q) over (f>. In the 
M-stcp, we now fix Q{F), i.e., fix the variational pa- 
rameters 0, and update the model parameters fi and 
Q to maximize Cq. 

First, in order to maximize Cq{^, O) with respect to fi, 
we need to maximize = J2i ^Q,i [^ogP{Fii\fj.i)] for 



9MI 



each Hi. By definitions in Eq. and (O, we obtain 



Then is maximized when 



dC, 



^ = ^ 0./ - iV = 



where Mi = Si ^u- 

Second, to maximize Cq{ii,Q) with respect to Qi, we 
maximize Cq = Eg [log P{A, F\fi, 9) - log Q(i^)]. We 
first obtain the gradient 

Ve.Ce ^ Ve,EQ^,^. [logP(A,, |i^„F„ 6)] (10) 

and then use a gradient-based method to optimize 
Cq{h, 6) with regard to 6;. Algorithm [5] gives details 
for optimizing Cq{^j., Q) over /i and Q. 

Speeding up MagFit. So far we described how to 
apply the variational EM algorithm to MAG model pa- 
rameter estimation. However, both E-step and M-step 
are infeasible when the number of nodes N is large. In 
particular, in the E-step, for each update of (pu, we 
have to compute the expected log-likelihood value of 
every entry in the i-th row and column of the adja- 
cency matrix A. It takes 0{LN) time to do this, so 
overall 0{L'^N'^) time is needed to update all (j)u. Sim- 
ilarly, in the M-step, we need to sum up the gradient 
of 8; over every pair of nodes (as in Eq. (flO)) ). There- 
fore, the M-step requires 0{LN'^) time and so it takes 
0{L^N'^) to run a single iteration of EM. Quadratic 
dependency in the number of attributes L and the 
number of nodes N is infeasible for the size of the 
networks that we aim to work with here. 

To tackle this, we make the following observation. 
Note that both Eq. (fTOl) and computation of in- 
volve the sum of expected values of the log-likelihood 
or the gradient. If we can quickly approximate this 
sum of the expectations, we can dramatically reduce 
the computation time. As real-world networks are 
sparse in a sense that most of the edges do not exist 
in the network, we can break the summation into two 
parts — a fixed part that "pretends" that the network 
has no edges and the adjustment part that takes into 
account the edges that actually exist in the network. 

For example, in the M-stcp we can separate Eq. ((TU)) 
into two parts, the first term that considers an empty 
graph and the second term that accounts for the edges 
that actually occurred in the network: 

Ve,£e =^Ve,EQ,,^ [log P(0!F„ F,, 6)] 

+ J2 Ve,EQ.^^. [logP(l|F„F„e)-logP(OlF„F„e)] 

(11) 



Now we approximate the first term that computes the 
gradient pretending that the graph A has no edges: 

^Ve,EQ.,^ [logP(0|F„^;-,e)] 

i-J 

« Ve,EQ^,^. [N{N - l)Ej.[logP(0|F, 9)]] 

= Ve,A^(A^-l)EF[logP(0|F,e)]. (12) 

Since each Fu foUows the Bernoulli distribution with 
parameter /i;, Eq. (jl2p can be computed in 0{L) time. 
As the second term in Eq. ([TlT) requires only 0{LE) 
time, the computation time of the M-step is reduced 
from 0{LN'^) to 0{LE). Similarly we reduce the com- 
putation time of the E-stcp from 0{L'^N'^) to 0{L^E) 
(see Appendix for details). Thus overall we reduce 
the computation time of MagFit from 0{L'^N'^) to 
OiL^E). 

4 Experiments 

Having introduced the MAG model estimation proce- 
dure MagFit, wc now turn our attention to evaluat- 
ing the fitting procedure itself and the ability of the 
MAG model to capture the connectivity structure of 
real networks. There are three goals of our experi- 
ments: (1) evaluate the success of MagFit param- 
eter estimation procedure; (2) given a network, infer 
both latent node attributes and the affinity matrices 
to accurately model the network structure; (3) given 
a network where nodes already have attributes, infer 
the affinity matrices. For each experiment, we proceed 
by describing the experimental setup and datascts. 

Convergence of MagFit. First, we briefly evalu- 
ate the convergence of the MagFit algorithm. For 
this experiment, we use synthetic MAG networks with 
TV = 1024 and i = 4. Figure |3(a)| illustrates that 
the objective function Cq, i.e., the lower bound of 
the log-likelihood, nicely converges with the number 
of EM iterations. While the log-likelihood converges, 
the model parameters /i and Q also nicely converge. 
Figure 3(b) shows convergence of /ii,...,/^4, while 
Fig. 3(c) shows the convergence of entries 8; [0,0] for 
I = 1, ... ,4. Generally, in 100 iterations of EM, we 
obtain stable parameter estimates. 

We also compare the runtime of the fast MagFit to 
the naive version where we do not use speedups for the 
algorithm. Figurc [3(d) [ shows the runtime as a function 
of the number of nodes in the network. The runtime of 
the naive algorithm scales quadratically 0{N'^), while 
the fast version runs in near-linear time. For example, 
on 4,000 node network, the fast algorithm runs about 
100 times faster than the naive one. 
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Figure 3: Parameter convergence and scalability. 

Based on these experiments, we conclude that the vari- 
ational EM gives robust parameter estimates. We note 
that the MagFit optimization problem is non-convex, 
however, in practice we observe fast convergence and 
good fits. Depending on the initialization MagFit 
may converge to different solutions but in practice so- 
lutions tend to have comparable log-likelihoods and 
consistently good fits. Also, the method nicely scales 
to networks with up to hundred thousand nodes. 

Experiments on real data. We proceed with ex- 
periments on real datascts. We use the Linkedin so- 
cial network [8] at the time in its evolution when it 
had TV = 4,096 nodes and E = 10,052 edges. We also 
use the Yahool-Answers question answering social net- 
work, again from the time when the network had N ~ 
4,096, E = 5,678 [5]. For our experiments we choose 
L = 1 1 , which is roughly log N as it has been shown 
that this is the optimal choice for L [5]. 

Now we proceed as follows. Given a real network A, we 
apply MagFit to estimate MAG model parameters Q 
and fl. Then, given these parameters, we generate a 
synthetic network A and compare how well synthetic 
A mimics the real network A. 

Evaluation. To measure the level of agreement be- 
tween synthetic A and the real A, we use several dif- 
ferent metrics. First, we evaluate how well A captures 
the structural properties, like degree distribution and 
clustering coefficient, of the real network A. We con- 
sider the following network properties: 

• In/Out-degree distribution (InD/OutD) is a his- 
togram of the number of in-coming and out-going 
links of a node. 

• Singular values (SVal) indicate the singular values 
of the adjacency matrix versus their rank. 

• Singular vector (SVec) represents the distribution 



of components in the left singular vector associ- 
ated with the largest singular value. 

• Clustering coefficient ( CCF) represents the degree 
versus the average (local) clustering coefficient of 
nodes of a given degree [TH] . 

• Triad participation (TP) indicates the number of 
triangles that a node is adjacent to. It measures 
the transitivity in networks. 

Since distributions of the above quantities are gener- 
ally heavy-tailed, we plot them in terms of comple- 
mentary cumulative distribution functions {P{X > x) 
as a function oi x). Also, to indicate the scale, we do 
not normalize the distributions to sum to 1. 

Second, to quantify the discrepancy of network 
properties between real and synthetic networks, we 
use a variant of Kolmogorov-Sminorv (KS) statistic 
and the L2 distance between different distribu- 
tions. The original KS statistics is not appropriate 
here since if the distribution follows a power-law 
then the original KS statistics is usually domi- 
nated by the head of the distribution. We thus 
consider the following variant of the KS statistic: 
KS{Di,D2) = maxJlogDi(x) -logi:)2(x)| |6j, where 
Di and D2 are two complementary cumulative distri- 
bution functions. Similarly, we also define a variant 
of the L2 distance on the log-log scale, L2{Di, D2) = 



/2(a;)) d{\ogx)^ 

where [a, b] is the support of distributions Di and I?2- 
Therefore, we evaluate the performance with regard 
to the recovery of the network properties in terms of 
the KS and L2 statistics. 

Last, since MAG generates a probabilistic adjacency 
matrix P, we also evaluate how well P represents a 
given network A. We use the following two metrics: 

• Log-likelihood (LL) measures the possibility that 
the probabilistic adjacency matrix P generates 
network A: LL = Y.^3 ^og{Pf;' (1 - P^^y-^-). 

• True Positive Rate Lmprovement (TPL) represents 
the improvement of the true positive rate over a 
random graph: TPI = ^j^, ,^^Pij / TPI in- 
dicates how much more probability mass is put 
on the edges compared to a random graph (where 
each edge occurs with probability E/N'^). 

Recovery of the network structure. We begin 
our investigations of real networks by comparing the 
performance of the MAG model to that of the Kro- 
necker graphs model [S], which offers a state of the 
art baseline for modeling the structure of large net- 
works. We use evaluation methods described in the 
previous section where we fit both models to a given 
real- world network A and generate synthetic Am AG 
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Figure 4: The recovered network properties by the 
MAG model and the Kronecker graphs model on the 
Linkcdin network. For every network property, MAG 
model outperforms the Kronecker graphs model. 

and AKron- Then we compute the structural proper- 
ties of all three networks and plot them in Figure ID 
Moreover, for each of the properties we also compute 
KS and L2 statistics and show them in Table [TJ 

Figure S] plots the six network properties described 
above for the Linkedin network and the synthetic 
networks generated by fitting MAG and Kronecker 
models to the Linkedin network. We observe that 
MAG can successfully produce synthetic networks that 
match the properties of the real network. In particular, 
both MAG and Kronecker graphs models capture the 
degree distribution of the Linkedin network well. How- 
ever, MAG model performs much better in matching 
spectral properties of graph adjacency matrix as well 
as the local clustering of the edges in the network. 

Table [T] shows the KS and L2 statistics for each of the 
six structural properties plotted in Figure m Resuhs 
confirm our previous visual inspection. The MAG 
model is able to fit the network structure much bet- 
ter than the Kronecker graphs model. In terms of the 
average KS statistics, we observe 43% improvement, 
while observe even greater improvement of 70% in the 
L2 metric. For degree distributions and the singular 
values, MAG outperforms Kronecker for about 25% 



Table 1: KS and L2 of MAG and the Kronecker graphs 
model on the Linkcdin network. MAG exhibits 50-70% 
better performance than Kronecker graphs model. 



Table 2: LL and TPI values for Linkedin {LI) 
Yahool-Answers ( YA) networks 
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Figure 5: Structures in which a node attribute can 
affect link affinity. The widths of arrows correspond 
to the affinities towards link formation. 



while the improvement on singular vector, triad par- 
ticipation and clustering coefficient is 60 ^ 75%. 

We make similar observations on the Yahool-Answers 
network but omit the results for brevity. We include 
them in Appendix. 

We interpret the improvement of the MAG over Kro- 
necker graphs model in the following way. Intuitively, 
we can think of Kronecker graphs model as a version of 
the MAG model where all affinity matrices B; are the 
same and all = 0.5. However, real- world networks 
may include various types of structures and thus dif- 
ferent attributes may interact in different ways. For 
example. Figure [S] shows three possible linking affini- 
ties of a binary attribute. Figure [Sja) shows a ho- 
mophily (love of the same) attribute affinity and the 
corresponding affinity matrix Q. Notice large values on 
the diagonal entries of 0, which means that link prob- 
ability is high when nodes share the same attribute 
value. The top of each figure demonstrates that there 
will be many links between nodes that have the value 
of the attribute set to "0" and many links between 
nodes that have the value "1", but there will be few 
links between nodes where one has value "0" and the 
other "1". Similarly, Figure [5][b) shows a heterophily 
(love of the different) affinity, where nodes that do not 
share the value of the attribute are more likely to link, 
which gives rise to near-bipartite networks. Last, Fig- 
ureini^c) shows a core-periphery affinity, where links are 
most likely to form between "0" nodes [i.e., members 
of the core) and least likely to form between "1" nodes 
(i.e., members of the periphery). Notice that links be- 
tween the core and the periphery are more likely than 
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the links between the nodes of the periphery. 

Turning our attention back to MAG and Kronecker 
models, we note that real-world networks globally ex- 
hibit nested core-periphery structure [11 (Figure [5l[c)). 
While there exists the core (densely connected) and 
the periphery (sparsely connected) part of the net- 
work, there is another level of core-periphery struc- 
ture inside the core itself. On the other hand, if 
viewing the network more finely, we may also ob- 
serve the homophily which produces local community 
structure. MAG can model both global core-periphery 
structure and local homophily communities, while the 
Kronecker graphs model cannot express the different 
affinity types because it uses only one initiator matrix. 

For example, the Linkedin network consists of 4 core- 
periphery affinities, 6 homophily affinities, and 1 het- 
erophily affinity matrix. Core-periphery affinity mod- 
els active users who are more likely to connect to oth- 
ers. Homophily affinities model people who are more 
likely to connect to others in the same job area. In- 
terestingly, there is a heterophily affinity which results 
in bipartite relationship. We believe that the relation- 
ships between job seekers and recruiters or between 
employers and employees leads to this structure. 

TPI and LL. We also compare the LL and TPI val- 
ues of MAG and Kronecker models on both Linkedin 
and Yahoo!-Answers networks. Table [2] shows that 
MAG outperforms Kronecker graphs by surprisingly 
large margin. In LL metric, the MAG model shows 
50 '--^ 60 % improvement over the Kronecker model. 
Furthermore, in TPI metric, the MAG model shows 
23 ^ 35 times better accuracy than the Kronecker 
model. From these results, we conclude that the MAG 
model achieves a superior probabilistic representation 
of a given network. 

Case Study: AddHealth netv^rork. So far we con- 
sidered node attributes as latent and we inferred the 
affinity matrices Q as well as the attributes themselves. 
Now, we consider the setting where the node attributes 
are already given and we only need to infer affinities Q. 
Our goal here is to study how real attributes explain 
the underlying network structure. 

We use the largest high-school friendship network 
{N = 457, E = 2,259) from the National Longitudinal 
Study of Adolescent Health (AddHealth) dataset. The 
dataset includes more than 70 school-related attributes 
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Figure 6: Properties of the AddHealth network. 

for each student. Since some attributes do not take bi- 
nary values, we binarize them by taking value 1 if the 
value of the attribute is less than the median value. 
Now we aim to investigate which attributes affect the 
friendship formation and how. 

We set L = 7 and consider the following methods for 
selecting a subset of 7 attributes: 

• R7: Randomly choose 7 real attributes and fit the 
model {i.e., only fit 6 as attributes are given). 

• L7: Regard all 7 attributes as latent (i.e., not 
given) and estimate /i/ and Qi for I = 1 , . . . , 7. 

• F7: Forward selection. Select attributes one by 
one. At each step select an additional attribute 
that maximizes the overall log-likelihood (i.e., se- 
lect a real attribute and estimate its Qi). 

• F5+L2: Select 5 real attributes using forward se- 
lection. Then, we infer 2 more latent attributes. 

To make the MagFit work with fixed real attributes 
(i.e., only infer Q) we fix 0^; to the values of real at- 
tributes. In the E-step we then optimize only over the 
latent set of (pa and the M-step remains as is. 

AddHealth network structure. We begin by eval- 
uating the recovery of the network structure. Figure |6] 
shows the recovery of six network properties for each 
attribute selection method. We note that each method 
manages to recover degree distributions as well as spec- 
tral properties (singular values and singular vectors) 
but the performance is different for clustering coeffi- 
cient and triad participation. 

Table [3] shows the discrepancies in the 6 network prop- 
erties [KS and L2 statistics) for each attribute selec- 
tion method. As expected, selecting 7 real attributes 
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Table 4: LL and TPI for the AddHealth network. 
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at random (R7) performs the worst. Naturally, L7 per- 
forms the best (23% improvement over R7 in KS and 
50% in L2) as it has the most degrees of freedom. It 
is followed by F5-I-L2 (the combination of 5 real and 2 
latent attributes) and F7 (forward selection). 

As a point of comparison we also experimented with 
a simple logistic regression classifier where given the 
attributes of a pair of nodes we aim to predict an oc- 
currence of an edge. Basically, given network A on 
nodes, we have A''^ (one for each pair of nodes) train- 
ing examples: E are positive (edges) and A^^ — E are 
negative (non-edges). However, the model performs 
poorly as it gives 50% worse KS statistics than MAG. 
The average KS of logistic regression under R7 is 3.24 
(vs. 2.32 of MAG) and the same statistic under F7 
is 3.00 (vs. 2.05 of MAG). Similarly, logistic regres- 
sion gives 40% worse L2 under R7 and 50% worse 
L2 under F7. These results demonstrate that using 
the same attributes MAG heavily outperforms logistic 
regression. We understand that this performance dif- 
ference arises because the connectivity between a pair 
of nodes depends on some factors other than the linear 
combination of their attribute values. 

Last, we also examine the LL and TPI values and 
compare them to the random attribute selection R7 
as a baseline. Table |4] gives the results. Somewhat 
contrary to our previous observations, we note that 
F7 only slightly outperforms R7, while F5+L2 gives a 
factor 2 better TPI than R7. Again, L7 gives a factor 
10 improvement in TPI and overall best performance. 

Attribute affinities. Last, we investigate the struc- 
ture of attribute affinity matrices to illustrate how 
MAG model can be used to understand the way real 
attributes interact in shaping the network structure. 
We use forward selection (F7) to select 7 real attributes 
and estimate their affinity matrices. Table [5] reports 
first 5 attributes selected by the forward selection. 



Table 5: Affinity matrices of 5 AddHcalth attributes. 



Affinity matrix 


Attribute description 


[0.572 0.146; 0.146 0.999J 


School year (0 if > 2) 


[0.845 0.332; 0.332 0.816J 


Highest level math (0 if > 6) 


[0.788 0.377; 0.377 0.784J 


Cumulative CPA (0 if > 2.65) 


[0.999 0.246; 0.246 0.352J 


AP/IB English (0 if taken) 


[0.794 0.407; 0.407 0.717J 


Foreign language (0 if taken) 



First notice that AddHealth network is undirected 
graph and that the estimated affinity matrices are all 
symmetric. This means that without a priori biasing 
the fitting towards undirected graphs, the recovered 
parameters obey this structure. Second, we also ob- 
serve that every attribute forms a homophily struc- 
ture in a sense that each student is more likely to be 
friends with other students of the same characteristic. 
For example, people are more likely to make friends 
of the same school year. Interestingly, students who 
are freshmen or sophomore are more likely (0.99) to 
form links among themselves than juniors and seniors 
(0.57). Also notice that the level of advanced courses 
that each student takes as well as the GPA affect the 
formation of friendship ties. Since it is difficult for stu- 
dents to interact if they do not take the same courses, 
the chance of the friendships may be low. We note 
that, for example, students that take advanced place- 
ment (AP) English courses are very likely to form 
links. However, links between students who did not 
take AP English are nearly as likely as links between 
AP and non-AP students. Last, we also observe rel- 
atively small effect of the number of foreign language 
courses taken on the friendship formation. 

5 Conclusion 

We developed MagFit, a scalable variational expec- 
tation maximization method for parameter estimation 
of the Multiplicative Attribute Graph model. The 
model naturally captures interactions between node 
attributes and the network structure. MAG model 
considers nodes with categorical attributes and the 
probability of an edge between a pair of nodes depends 
on the product of individual attribute link formation 
affinities. Experiments show that MAG reliably cap- 
tures the network connectivity patterns as well as pro- 
vides insights into how different attributes shape the 
structure of networks. Venues for future work include 
settings where node attributes are partially missing 
and investigations of other ways to combine individ- 
ual attribute linking affinities into a link probability. 
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A Variational EM Algorithm 

In Section [21 we proposed a version of MAG model by 
introducing a generative Bernoulli model for node at- 
tributes and formulated the problem to solve. In the 
following Section |31 we gave a sketch of MagFit that 
used the variational EM algorithm to solve the prob- 
lem. Here we provide how to compute the gradients of 
the model parameters {(j), fi, and 9) for the of E-step 
and M-step that we omitted in Section |3l We also give 
the details of the fast MagFit in the following. 

A.l Variational E-Step 

In the E-step, the MAG model parameters fi and 
are given and we aim to find the optimal variational 
parameter (j) that maximizes Cq{^,, Q) as well as min- 
imizes the mutual information factor MI(F). We ran- 
domly select a batch of entries in (j) and update the se- 
lected entries by their gradient values of the objective 
function Cq{^,, 0). We repeat this updating procedure 
until (j) converges. 

In order to obtain {Cq{^, Q) — AMI(i^)), we com- 



pute and in turn as follows. 



aMI 



Computation of 4§^. To calculate the partial 



function of one specific parameter and differenti- 
ate this function over (pu. For convenience, we denote 
F-ii = {Fjk : j ^ i,k ^ 1} and Q^^i = H^^ij^^i Qjk- 
Note that Y.Fu QaiPii) = 1 and J2f^u Q-uiF-u) be- 
cause both are the sums of probabilities of all possible 
events. Therefore, we can separate Cq{h, 0) in Eq. (jH]) 
into the terms of Qii{Fii) and Q-iiiF^u)'- 

= Eq [\ogP{A,F\^l,Q)-\ogQ{F)] 

= Y,Q{F) (log P [A, F\ii, 0) - log Q (F)) 

F 

F-ii Fa 

X {log P {A, F\fi, 0) - logg,i (Fu) ~ \ogQ.u (F-u)) 



Fu 



Y.Qa{Fa)\ogQa{Fa 



Fu 



F-i 



Y,Qa{Fa)^Q_^, [logP (A, 0)] 

Fu 

+ n{Qii) + n{Q-u) 



(13) 



where T-L^P) represents the entropy of distribution P. 

Since we compute the gradient of (pu, we regard 
the other variational parameter (jj^u as a con- 
stant so HiQ-ii) is also a constant. Moreover, as 
Eg .j [log P (A, F\ii,Q)] integrates out all the terms 
with regard to (p-u, it is a function of Fu. Thus, 
for convenience, we denote Eq ., [logP {A, F\^, 0)] as 
log Pa (Fa). Then, since Fu follows a Bernoulli distri- 
bution with parameter (pu, by Eq. (|13p 



derivative we begin by restating £q(/x, 0) as a 



CQifi, e) = {l- cPa) (logP^i (1) - log(l - cbu)) 

+ (f>a (log Pa (0) - log (f>a) + const . (14) 



Note that both Pa (0) and Pa (1) are constant. There- 
fore, 



d£.Q Pa (0) Pa (1) 



To complete the computation of , now we focus on 

the value of Pa {Fa) for Fa = 0, 1. By Eq. (jH) and the 
linearity of expectation, log Pa {Fa) is separable into 



small tractable terms as follows: 

log Pa {Fu) = ^Q_,, [logPiA,F\^l,e)] 

= ^Eq_, [logP(A„„|i^„,F„,e)] 

u,v 

+ Y,Eq-u [^og PiFuM] (16) 

where Fi = {Fu : I = 1,2,- •• ,L}. However, if 
u,v ^ i, then Eq_,, [logP(A„^„|i^u,F„, 9)] is a con- 
stant, because the average over Q-ii{F-ii) integrates 
out all the variables Fu and Fv Similarly, if u 7^ i 
and k I, then Eg ., [log P{Fuk\l-tk)] is a constant. 
Since most of terms in Eq. p6p are irrelevant to 0^/, 
log Pii {Fu ) is simplified as 

logi5,, (Fa) - (^Eq_, [logPiA,\F,,Fj,e)] 



Since logP(Aij|Pi,Fj,0) is not separable in 
terms of &k, it takes 0(2^^) time to compute 
'Eq_. I [log P{Aij\Fi,Fj,Q)] exactly. We can reduce 
this computation time to 0{L) by applying Taylor's 
expansion of log(l — a;) « — x — for small x: 

Eq_, [logF(A,, =0|F„F„e)] 



Er 



= -Eq^, [eap,,,^,,]] Heq.,,, [efc[F.,,F, 



where each term can be computed by 



(21) 



^Eq_,, [logPiA,,[F,,F.„e)] 



+ logP{Fa[fii) + C 
for some constant C. 



(17) 



By definition of P{Fa[iJ.i) in Eq. the last term in 
Eq. dni) is 



logP(F,z|Ai/) =F,aogAiz + (l-F,01og(l-M./). (18) 
With regard to the first two terms in Eq. p7|) . 

iogP(Ay|F„p„e) = iogP(A,,|p„p„e^) . 

Hence, the methods to compute the two terms are 
equivalent. Thus, we now focus on the computation 

of Eq_„ [iogP(A,,|p„p„e)]. 

First, in case of = 1, by definition of P(Ay |Pi, Fj) 
in Eq. (g]), 

Eq_., [logP(A,, = l|P„P_,-,e)] 



E 



Q-n 



^iogefe[p,fe,Pjfe] 



= Eq^, [ioge,[p,,,p,,]] + ^Eq,,,^, [iogefc[p,,,p,,]] 

= EQjlogei[Fa,F,i]]+C' (19) 

for some constant C where Qik,jk{Fik, Fjk) = 
Qik{Ftk)Qjk{Fjk), because Eg^.^^^. [logefe[PiA;, Pj^]] is 
constant for each k. 

Second, in case of Ay = 0, 

P( A,j = 1 P„ Pj , e) = 1 - [] Ofc [P,fc , F,k] ■ (20) 



Eg,, [Yi[Fa,F,i]] ^ ^jiYi[Fa,0] + (1 - <^,/)y,[P,z, 1] 
Eq,.„. [Yk[F,k, Fjk]] = [(b,k ■Yk-[1^ <i>,k 1 - 

for any matrix Yi.Yk e M^^^. 

In brief, for fixed i and /, we first compute 
Eg_i, [logP(Aij|Pi, Pj, 6)] for each node j depend- 
ing on whether or not i — > j is an edge. By adding 
logP(PizlyLtz), we then acheive the value of logP^j {Fii) 
for each Fa ■ Once we have log Pa [Fa ) , we can finally 
compute ^f-- 

Scalable computation. However, as we analyzed in 
Section[3J the above E-step algorithm requires 0{LN) 
time for each computation of so that the to- 

tal computation time is 0{L^N^), which is infeasible 
when the number of nodes N is large. 

Here we propose the scalable algorithm of computing 
by further approximation. As described in Sec- 
tion 131 we quickly approximate the value of as 
if the network would be empty, and adjust it by the 
part where edges actually exist. To approximate 
in empty network case, we reformulate the first term 
in Eq. dTT]): 

^Eq_,, [iogP(Ay|p„p„e)] = 5]Eg_., [iogp(o|p.,p„e)] 

j 3 

+ Eq_„ [iogp(i|p„p„e)-iogp(o|p„p„e)] 

(22) 

However, since the sum of i.i.d. random variables can 
be approximated in terms of the expcctaion of the ran- 
dom variable, the first term in Eq. (|22p can be approx- 



imated as follows: 

^Eq_,, [iogP(o|i^,,F,,e)] 



Er 



^iogP(o|i^„F,-,e) 



Eq_. [{N - l)E^JlogP(0|F„F„e)] 

(7v-i)E^^jiogP(o|p„p„e)] 



(23) 



As Fji marginally follows a Bernoulli distribution with 



Hi, we can compute Eq. (|23p by using Eq. (PT|) in 0{L) 
time. Since the second term of Eq. takes 0{LNi) 
time where TV^ represents the number of neighbors of 
node i, Eq. takes only 0{LNi) time in total. As 
in the E-step we do this operation by iterating for all 
j's and Z's, the total computation time of the E-step 
eventually becomes 0{L^E), which is feasible in many 
large-scale networks. 

Computation of • Now we turn our attention to 
the derivative of the mutual information term. Since 
MI(P) = J2i^i' MIo' , we can separately compute the 

derivative of each term . By definition in Eq. 



and Chain Rule, 



dMl, 



^4>^l 



dpu'{x,y) Pii'{x,y) 



' Pi{x)pv{y) 



a;,ye{0,l} 

dpii'{x,y) , pu'{x,y) dpi{x) pii{x,y) dpv{y) 



d4>ii 



pi {x) d4>ii 



pi'{y) 



(24) 



The values of pii'{x,y), pi{x), and pv{y) are defined 

in Eq. ([9|). Therefore, in order to compute 

need the values of ^ 
definition in Eq. ([9]), 



we 



and %M. By 



d4>ii 



dpijx) ^ dQii 
d(j)ii dcl)ii 

dpi' (y) 



dQ., 



'^liere ^\fu=o = 1 and -o^\Fa=i 

Since all terms in are tractable, we can eventu- 

ally compute 

A. 2 Variational M-Step 



Cq{ii,Q) as well as to minimize the mutual informa- 
tion between every pair of attributes. In the M-step, 
we basically fix the approximate posterior distribution 
Q{F), i.e. fix the variational parameter (j), and update 
the model parameters fi and Q to maximize Cq{h, Q). 

To reformulate £-q{h,Q) by Eq. g]). 



= Eq [logP(A,P|Ai,e)-logQ(P)] 



i,j i,l 

^EQ.jiogP(A„|p„p„e)] 

+ E (E^Q. [\og P{Fa\^il)]^ +H{Q) 



n{Q) 



(25) 



where Q^j{F{i.},F{j.^^) represents J]; Qii{Fii)Qji{Fji). 

After all, Cq{ii,Q) in Eq. is divided into the 

following terms: a function of 0, a function of fii, 
and a constant. Thus, we can exclusively update fi 
and 0. Since we already showed how to update fi 
in Section |3l here we focus on the maximization of 
Ce = Eg [logP{A,F\n,Q) - log Q(P)] using the gra- 
dient method. 

Computation of Vq^Cq. To use the gradient 
method, we need to compute the gradient of 



Ve.Ce = 5] Ve,EQ„^ [logP(Ay|P„P„e)] . (26) 



We separately calculate the gradient of each term in 
£e as follows: For every zi, Z2 S {0, 1}, if Ajj = 1, 



9Eq,,^ [logP(A,,|P„P,-,e)] 



56; [zi,; 



d 



■E, 



Qi.j 



dQi[zi,z2 
d 

dQi[zi,z2 
Qa{zi)Qji[z2) 

Ql[zi,Z2] 



^iogefc[p,fc,p 



Eq.,^. [\ogQi[Fa,Fji 



(27) 



In the E-Step, with given model parameters and 0, 
we updated the variational parameter </> to maximize 



On the contrary, if Aij = 0, we use Taylor's expansion 



as used in Eq. (PT|) : 

5Eq.,^. [log P{A,,\F,,F,,Q)] 



99; [21,2:2] 



d 



■E, 



1 



k k 

= ~Quizi)Qjiiz2)l[EQ,,„, [ek[F^k,F,„]] 

k^l 

-Quizi)Qji{z2)ek[zi,Z2]lll^Q,,^^, [Ql[F,k,Fj„] 

k^l 

(28) 

where Qaji{Fu,Fji) = Q,i{Fu)Qji{Fji). 
Since 

Eq,.„. [/(e)] = fe(^i)Q.fc(^2)/(e[zi,z2]) 



for any function / and we know each function values 
of Qii{Fii) in terms of (pa, we are able to achieve the 
gradient Ve,/:e bv Eq. p6)) ^ p8| . 
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Scalable computation. The M-step requires to sum 
0{N'^) terms in Eq. (f26| where each term takes 0{L) 
time to compute. Similarly to the E-step, here we 
propose the scalable algorithm by separating Eq. ((26)) 
into two parts, the fixed part for an empty graph and 
the adjustment part for the actual edges: 

Ve,/:e -^Ve,EQ,,^ [logP(0|^^„ , 6)] 

i,3 



(e) Clustering coefficient (f) Triad participation 

Figure 7: The recovered network properties by the 
MAG model and the Kronecker graphs model on the 
Yahool-Answers network. For every network property, 
MAG model outperforms the Kronecker graphs model. 
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^ Ve,EQ„^ [logP(l|^^„^;,e) -logP(0|^^„^;,e)] 



(29) 



We are able to approximate the first term in Eq. ([29 
the value for the empty graph part, as follows: 

^Ve,EQ„^ [logP(0|^„i^„e)] 



^e,EQ,^. 



^logP(0|F„F„e) 



Ve,EQ^ ^. [N{N -l)EF[\ogP{{)\F,Q)] 
Ve,N{N - l)Ep'[logP(0|i^, 6)] . 



(30) 



Since each Fn marginally follows the Bernoulli distri- 
bution with /ij, Eq. ([5(1)) is computed by Eq. ([^ in 
0{L) time. As the second term in Eq. ()29p requires 
only 0{LE) time, the computation time of the M-step 
is finally reduced to 0{LE) time. 



B.l Yahool-Ansers Network 

Here we add some experimental results that we omit- 
ted in Section [4] First, Figure [7] compares the six 
network properties of Yahool-Answers network and 
the synthetic networks generated by MAG model and 
Kronecker graphs model fitted to the real network. 
The MAG model in general shows better performance 
than the Kronecker graphs model. Particularly, the 
MAG model greatly outperforms the Kronecker graphs 
model in local-clustering properties (clustering coeffi- 
cient and triad participation). 

Second, to quantify the recovery of the network prop- 
erties, we show the KS and L2 statistics for the syn- 
thetic networks generated by MAG model and Kro- 
necker graphs model in Table [Sj Through Table [SI 
we can confirm the visual inspection in Figure [7] The 
MAG model shows better statistics than the Kronecker 
graphs model in overall and there is huge improvement 
in the local-clustering properties. 



Table 6: KS and L2 for MAG and Kronecker model 
fitted to Yahoo!-Answers network 



KS 


InD 


OutD 


SVal 


SVec 


TP 


CCF 


Avg 


MAG 


3.00 


2.80 


14.93 


13.72 


4.84 


4.80 


7.35 


Kron 


2.00 


5.78 


13.56 


15.47 


7.98 


7.05 


8.64 


L2 


MAG 


0.96 


0.74 


0.70 


6.81 


2.76 


2.39 


2.39 


Kron 


0.81 


2.24 


0.69 


7.41 


6.14 


4.73 


3.67 



Table 7: KS and L2 for logistic regression methods 
fitted to AddHealth network 



KS 


InD 


OutD 


SVal 


SVec 


TP 


CCF 


Avg 


R7 


2.00 


2.58 


0.58 


3.03 


5.39 


5.91 


3.24 


F7 


1.59 


1.59 


0.52 


3.03 


5.43 


5.91 


3.00 


L2 


R7 


0.54 


0.58 


0.29 


1.09 


3.43 


2.42 


1.39 


F7 


0.42 


0.24 


0.27 


1.12 


3.55 


2.09 


1.28 



B.2 AddHealth Network 

We briefly mentioned the logistic regression method in 
AddHealth network experiment. Here we provide the 
details of the logistic regression and full experimental 
results of it. 

For the variables of the logistic regression, we use a set 
of real attributes in the AddHealth network dataset. 
For such set of attributes, we used F7 (forward selec- 
tion) and R7 (random selection) defined in Section HI 
Once the set of attributes is fixed, we come up with a 
linear model: 

^ ^' 1 + exp(c + c^iFa + El PiFji) ' 

Table [7] shows the KS and L2 statistics for logistic 
regression methods under R7 and F7 attribute sets. 
It seems that the logistic regression succeeds in the 
recovery of degree distributions. However, it fails to 
recover the local-clustering properties (clustering coef- 
ficient and triad participation) for both sets. 



